In preparation for the following hands-on experiences we will set up an R environment which has everything we need.
suppressPackageStartupMessages({
library(Seurat)
library(ggplot2)
library(tidyr)
library(scDblFinder)
library(SingleCellExperiment)
})
For this tutorial we will use raw unprocessed data from human peripheral blood mononuclear cells (PBMCs) generated using the 10X Genomics platform which can be download from the Seurat page. These data should be downloaded, extracted, and placed in the downloads folder for this workshop.
# Load the PBMC dataset
pbmc.data <- Read10X(data.dir = "./downloads/filtered_gene_bc_matrices/hg19/")
# Initialize the Seurat object with the raw (non-normalized data).
seurat_object <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k",
min.cells = 3, min.features = 200)
Note: We apply some preliminary filters as we know that GEMs which very low numbers of genes or genes which are barely detected are unlikely to be actual cells.
print(seurat_object)
To better understand how to work with single-cell/nuclei RNA sequencing data we will take a deep dive into how datasets are typically structured. In the diagram below we illustrate how droplet-based scRNAseq approaches separate cells and their mRNAs. But how does this translate into data?
# Raw counts in a seurat object is saved in the assays-RNA-counts slot
raw_counts = as.matrix(seurat_object@assays$RNA@layers$counts)
rownames(raw_counts) = rownames(seurat_object)
colnames(raw_counts) = colnames(seurat_object)
# Lets look at the first 4 columns and rows
print(raw_counts[1:4,1:4])
What does this mean?
Columns: The columns represent a unique
barcode. As we can see below, this barcode
marks an individual cell which is how we can identify each one. So in
this object, the columns are the cells
Rows: We see that the first few rows have names such as RP11-206L10.9. A quick google search will tell you that this is a human gene (Ensembl ID - ENSG00000237491). So we see in this object that the rows represent the genes.
Single-cell transcriptomic analysis tools (e.g., Seurat) does a lot of the heavy lifting for us when it comes to counts and other summary data information. For example, the following code will tell you the number of genes and cells, as well store raw, normalized, and scaled data, dimension reductions, and other analyses which may have been performed.
print(seurat_object)
This indicates that we have 13,714 genes (features) across 2,700 samples.
However, we can also extract that information from the count table:
print(paste0('Cells: ', ncol(raw_counts)))
print(paste0('Features:', nrow(raw_counts)))
These numbers should match!
# Calculate the total reads in each individual cell
read_depths = colSums(raw_counts)
# How are they distributed?
hist(read_depths, main = "Histogram of read depth",
xlab = "Depth", ylab = "Frequency",
breaks = 200)
This histogram shows that the number of mRNA transcripts detected in each cell ranged from 546 to 15,818, but that most were around 2,000. So what might account for these differences?
Cell types: Some cell types may simply just express more mRNA than others. For example, Hepatocytes are known to contain a lot of RNA while others may not.
Stochastic gene expression: Fluctuations in transcription and degradation from normal biological processes, cell states, and other factors in individual cells which have very low amounts of RNA to begin can appear as big differences.
Technical variability: Sequencing depth, assay kits, and other technical factors may contribute to these differences.
The zero-inflation debate: In the early days of scRNAseq it was a widely accepted that data was zero-inflated, i.e., it had more zero’s than it should because of technical challenges. Today, we better understand that the large number of zero’s are not only technical, but in fact does reflect true biology as well. The following manuscript delves deeper into the debate: Jiang et al., 2022
Sequencing depth ultimately drives the ability to detect individual genes in a cell or sample (bulk RNA seq). As more reads are performed on a sample, the probability of capturing a lower abundance gene increases. Therefore, sequencing depth can have a significant impact on the end results of your sequencing results.
So why not sequence as deep as possibe? As with anything, we need to balance cost with value. The following figures are standard outputs from scRNAseq reports including those linked above.
As you can see, the more reads we have per cell, the smaller the slop for capturing new genes gets. Therefore, the difference between capturing 80% of genes and 100% could mean 2, 10, or even >100X the amount of sequencing. Is it truly worth it for capturing a gene who expression is so low? This is why we limit our depth for sequencing and why most datasets produced with the sample platform will tend to detect a similar number of genes.
We previously mentioned that single-cell/nuclei transcriptomic data has a lot of zeros. Lets actually look at the count distribution so you can see:
# We wan to collapse all the values into a vector of values
mat_values <- as.vector(raw_counts)
# Plot histogram
hist(mat_values, main = "Histogram of Matrix Values", xlab = "Value", ylab = "Frequency", breaks = 600, xlim = c(0,10))
This is why you will almost always find raw scRNAseq data saved as a
.mtx (sparse matrix) file or format.
Here we will download publicly accessible sparse matrices from the Gene Expression Omnibus, an excellent resource for obtaining raw data if you want to reuse and reanalyze existing data. This dataset is from single-nuclei RNA sequencing of livers from mice treated with the environmental contaminant 2,3,7,8-tetrachlorodibenzo-p-dioxin (TCDD).
#download.file('https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE184506&format=file&file=GSE184506%5Fsct%2Emtx%2Egz', './downloads/matrix.mtx.gz')
system("gunzip ./downloads/matrix.mtx.gz")
#download.file('https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE184506&format=file&file=GSE184506%5Fsct%2Ebarcodes%2Etsv%2Egz', './downloads/barcodes.tsv.gz')
system("gunzip ./downloads/barcodes.tsv.gz")
#download.file('https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE184506&format=file&file=GSE184506%5Fsct%2Egenes%2Etsv%2Egz', './downloads/genes.tsv.gz')
system("gunzip ./downloads/genes.tsv.gz")
# Read the first few lines of the file
lines <- readLines('./downloads/matrix.mtx', n = 10)
# Print the lines
cat(lines, sep = "\n")
As you can see, there is not a zero in sight here. After you read the first 2 summary lines you can see the following structure:
| Feature/Gene | Barcode/Cell | Count/Value |
|---|---|---|
| Refers to the row number in the genes.tsv file | Refers to the row number in the barcodes.tsv file | Include all non-zero values of the dataset |
Let’s see what this is here
# Read the specific line of the file
readLines('./downloads/genes.tsv', n = 15)[15]
# Read the specific line of the file
readLines('./downloads/barcodes.tsv', n = 1)[1]
From here we can see that the first row of data in the
mtx file represent the gene Sntg1
for the cell with barcode
AAACCCACAACTTCTT_1.
Seurat objects are probably the most commonly used to store and organize scRNAseq data for R users. So far this is what we have used for the published PBMC data which we imported in Section 2. Let’s take another closer look at this object:
print(slotNames(seurat_object)) # Lets examine all the slots
There is a lot of information, but the ones you will most commonly
use if you do single-cell analyses in R are assays,
meta.data, and reductions
Assays will hold the count data
Meta.data will hold the metadata for each individual cell/nuclei
Reductions will hold the dimension reductions for visualizations
Another common format in R are SingleCellExperiments. They contain the same information, but are organized in a different manner.
First, let’s convert our seurat object to SingleCellExperiment. Most formats can easily be interchanged:
sce = as.SingleCellExperiment(seurat_object)
Next, lets see what the structure looks like:
print(sce)
Here we can see:
rownames and rowData contain the gene information and metadata
colnames and colData contain the cell information and metadata
assays here also hold the count data
We can examine the pieces of data such as the cell metadata as follows:
print(colData(sce))
This workshop will not go into python resources, but it is imporant to note that Python and R are the two most common programming languages for single-cell transcriptomic analysis and it is not uncommon to have to use both. The H5ad file format is similar to the Seurat object and SingleCellExperiment object in the sense that it stores the different components of the data in an organized manner.
Any analysis, no matter of size, should undergo rigorous QC. When it comes to single-cell transcriptomic data, there are several computational tools that help you in the process from enabling visualizations to performing a panel of QC tests and reporting several different metrics for the end user to evaluate. A list of just some of these tools can be found here.
Let’s go through some of the most common QC metrics, what to they mean, and why do they matter?
How does the number of detected genes and transcripts indicate quality? While we don’t have a known number of expressed genes for each cell type, we do know that cells require the expression of hundreds to thousands of genes to function properly.
Number of genes (features)
We can check these values by running the code below:
VlnPlot(seurat_object, features = 'nFeature_RNA') + NoLegend()
Here we can see that (1) the number of genes detected ranges from ~200 - 3000 and (2) most express ~900 genes. Is this normal and expected? In reality, there is no universal threshold, these values will depend on the cell type and the sequencing depth. But we do know that if most cells had extemely low counts that either cells were not healthy/intact or sequencing depth was much too low. It could also point to errors further upstream such as alignment to the incorrect reference genome.
Number of mRNA molecules (count)
We can also examine the total number of mRNA molecules detected:
VlnPlot(seurat_object, features = 'nCount_RNA')
As above, we can see a good distribution centered around 2500 mRNA molecules. These value should always be higher than the number of features.
Correlation of features and molecules
A third way to look at this data is to examine the correlation of features and mRNA molecules:
FeatureScatter(object = seurat_object, feature1 = 'nCount_RNA', feature2 = 'nFeature_RNA')
This metric is more informative because it tells you that as you detect more mRNA molecules, you also detect more different genes. This would generally be expected as you are getting a larger sampling of the mRNA resulting in an increased chance of sampling different genes. However, when you assess these plots, you will want to keep the biology in mind.
Here, biological knowledge is extremely important for evaluating this metric and removing cells based on a threshold. Consider these three experiments:
Nuclei isolated from muscle cells which require a lot of energy
Muscle whole cells which require a lot of energy
Skin cells which are less metabolically active
What would you expect to see (qualitative) in the amount of mitochondrial genes for each of these experiments?
Assessing percent mitochondria counts
We can assess the percentage of mitochondrial genes by calculating the number of counts coming from mitochondrial genes (usually prefixed with mt- or MT-) compared to the total counts. This can be visualized like this:
# Calcuate the percentage of mitochondrial counts
seurat_object[["percent.mt"]] <- PercentageFeatureSet(seurat_object, pattern = "^MT-")
# Plot the values
VlnPlot(seurat_object, features = 'percent.mt')
We see that the percentage of counts representing mitochondrial genes is mostly below 5% but as high as ~25%. To evaluate these results think back to the three experiments and consider what you would expect to see:
Nuclei isolated from muscle cells which require a lot of energy
Despite being metabolically active, nuclei isolation is expected to wash out mitochondria. High contamination is indicative of high ambient RNA or mitochondria sticking to the the nuclei.
Muscle whole cells which require a lot of energy
Given that muscle cells are highly metabolically active and are known to have among the most mitochondria among mammalian cells, these data may show higher percentage of mitochondrial genes than most datasets.
Skin cells which are less metabolically active
Given that these are less metabolically active, they may be expected to show similar or lower percentage of mitochondrial genes.
Correlation of counts and mitochondrial counts
As with the features and counts, we can look at how these correlate:
FeatureScatter(seurat_object, feature1 = 'nCount_RNA', feature2 = 'percent.mt')
Interestingly, these are not correlated at all. Prior to filtering, these values will be anti-correlated as mitochondria often makes up the ambient RNA (release from dead or lysed cells) and will be more represented in the background (low count cells).
As with any single-cell transcriptomic tool, there are dozens, if not more, options for removal of abient RNA. Generally, the concept for ambient RNA removal is the same - substract genes which are present in the empty GEMs/wells from those which are found to have cells. Other techniques such as combinatorial barcoding approaches have to make different assumptions.
SoupX
is one example of a tool which cleans up the soup (floating
ambient RNA) from cells. This tools uses the most raw format of data
which is generate before a threshold is set on the knee plot discussed
above. For 10X Genomics these are stored in the
raw_feature_bc_matrix and
filtered_feature_bc_matrix folders.
This tutorial will not go through the entire ambient RNA removal workflow. Excellent examples can be found here and here.
Another common challenge with using single-cell transcriptomic platforms is one that can affect most single-cell based techniques including flow cytometry - doublets. These can be a result of cells sticking to each other or being too close to each other during the sorting, and the rate is typically a function of the total number of cells examined (more cells results in more doublets).
Note: We will discuss dimension reduction which is central to this approach in a later module.
Doublets can consist of either 2 cells of the same type (homotypic) or different cell types (heterotypic). As you can probably imagine, distinguising homotypic doublets is more challenging, but are generally considered to be less of a concern.
DoubletFinder and scDblFinder are R based tool for the identification of doublets based on the general principle outlined above. This example uses scDblFinder:
sce = as.SingleCellExperiment(seurat_object) # We need to convert the Seurat format to SingleCellExperiment format
sce <- scDblFinder(sce) # Run scDblFinder
Next we will transfer the information gained from this analysis into the Seurat object so that we can visualize it a later step.
# Transfer the doublet information to the seurat object
seurat_object$scDblFinder.class = colData(sce)$scDblFinder.class
seurat_object$scDblFinder.score = colData(sce)$scDblFinder.score
Ultimately, the purpose of the QC analyses is not only to understand your data and find any major issues. It is also used to identify individual data points which are likely problematic and could negatively impact subsequent analyses. Because we often have thousands of data points, we generally accept that some of these data points are technical noise which can simply be removed. For example, here we:
Remove cells with extremely high or low counts as either empty droplets or multiplets.
Remove cells with more than 5% of reads from mitochondrial genes as an indicator of unhealthy cells.
Remove cells classified as doublets.
seurat_object <- subset(seurat_object, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5 & scDblFinder.class == 'singlet')
print(seurat_object)
Finally we want to address the issue of unequal sequencing depth and other technical noise by performing normalization. Here we normalize the data using a regularized negative binomial regression which models adjusts the counts for each gene based on the total number of counts within a cell with adjustments for lowly and highly expressed genes as described here.
seurat_object = SCTransform(seurat_object, verbose = F)
Finally we want to visualize the data. To do this we need to perform dimension reduction which will be covered in a later module. We will run it here for visualization purposes.
seurat_object = RunPCA(seurat_object, verbose = F)
seurat_object = RunUMAP(seurat_object, dims = 1:30, verbose = F)
FeaturePlot(seurat_object, features = c('scDblFinder.score', 'percent.mt', 'nFeature_RNA', 'nCount_RNA'))
6. Dimension reduction and clustering; 7. Annotation of cell types; 8. Trajectories and RNA velocity; 9. Inferring cell-cell communication; 10. Spatial transcriptomics